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Abstract — This paper presents a new construction of 
Maximum-Distance Separable (MDS) Reed-Solomon erasure 
codes based on Fermat Number Transform (FNT). Thanks 
to FNT, these codes support practical coding and decoding 
algorithms with complexity 0(n log n), where n is the number of 
symbols of a codeword. An open-source implementation shows 
that the encoding speed can reach 150Mbps for codes of length 
up to several 10,000s of symbols. These codes can be used as the 
basic component of the Information Dispersal Algorithm (IDA) 
system used in a several P2P systems. 

I. Introduction and Related Work 

Erasure and network coding concepts were recently used in 
several kinds of networks in order to improve the throughput 
and the reliability of the systems. 

In distributed storage systems like P2P or RAID-based sys- 
tems, erasure codes can improve the degree of reliability and 
the persistence of the data following the Information Dispersal 
Algorithm (IDA) proposed by Rabin (TJ. This scheme is used 
is a large set of P2P systems proposals. Moreover, another 
benefit of erasure codes in this context is to decrease the 
average downloading time by increasing the data diversity in 
the network 0. 

Erasure codes have also applications in multimedia trans- 
missions where the packet losses can not be recovered by clas- 
sic retransmissions because of real-time constraints. Similarly, 
the reliability of multicast transmissions can be improved by 
codes since one repair packet can allow different receivers to 
recover different losses. 

Basically, an erasure code generates n encoding symbols 
from a set of k source symbols (k < n). The set of erasure 
codes can be split into two categories: MDS codes and non- 
MDS codes. The main property of MDS codes is that the k 
source symbols can be recovered from any set of k encoding 
symbols among the n ones. For non-MDS codes, (1 + e)k 
symbols are needed on average to recover the k source 
symbols, where e > 0. In counterpart, the encoding/decoding 
complexities of such non-MDS codes as Raptor |[3j or LDPC 
codes (4] is much lower than with MDS codes. 

MDS codes can be built from Vandermonde matrices, like 
Reed-Solomon codes 0, 0, or from Cauchy matrices Q, 
J8), 0. The optimal erasure recovery capability of MDS 
codes is necessary in many applications like distributed storage 
(peer-to-peer distributed storage or RAID-based systems) or 
multimedia/multicast transmissions. However, the quadratic 
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encoding/decoding complexities of the codes mentioned re- 
strain their practical use to applications needing short length 
codes (in practical, up to 255 symbols). 

The objective of our work is to design new MDS codes that 
support fast (sub-quadratic) operations. 

A recent work J9j proposed a Reed-Solomon encod- 
ing/decoding algorithm based on Walsh transforms with com- 
plexity 0(qlog 2 q), where q = 2 m is the size of the finite 
field. In order to go one step further towards sub-quadratic 
algorithms, we propose here Reed-Solomon codes defined over 
a finite field F p , where p is a Fermat prime, i. e. p = 2 2 +1. 
The interest of this choice is twofold: first, as noted in 0, 
the addition and multiplication operations in this finite field 
are performed very efficiently by modern computers since the 
processors are optimized for integer operations. The second 
(and the main) interest is that, unlike finite field of cardinal 
2 m , these finite fields support Fast Fourier Transform (FFT)- 
like algorithm with complexity O(nlogn) allowing very fast 
polynomial evaluations/interpolations. This algorithm, called 
Fermat Number Transform (FNT), uses the divide and conquer 
approach |10| of FFT by replacing the n th roots of unity in 
the complex field by a n th root of unity in this finite field. 

Even if the definition of Reed-Solomon codes in such field 
was already proposed ifTTI . our work is, at our knowledge, the 
first complete description of the different steps of the encoding 
and decoding algorithms expressed in terms of FNT with an 
accurate analysis of the complexities. In particular, the use of 
recent results on polynomial multiplication [12| allows us to 
reduce significantly the practical complexity (see Section [H). 
We show that the complexity of encoding and decoding are 
0(fclog 2 k + nlogn) for the symbol erasure channel. Since 
the term k log 2 k only concerns the location of the erasure in 
a codeword, these operations are only done once per packet 
for a network transmission, or, once per file chunk for a peer- 
to-peer network. It follows that the effective complexity is 
0(n log n), with a low factor (see details in section III I. These 



results are confirmed by an analysis of the performance of our 
codec which is compared to other available MDS codecs (see 
Section [TVl. 



Another benefit of our approach is that the connection of 
erasure codes with FNT could allow to re-use the optimized 
software and hardware implementations |fl3l developed for the 
applications of FNT in various domains like multi-precision 
multiplication, audio or image filtering. 



ii. fnt-based encoding/decoding scheme in a 
Fermat Field 

A. The Fourier Transform over GF(q) 

Assuming that q is prime, the values 0, l,...,q — 1 form a 
finite field where addition and product are processed modulo 
q. This finite field is known as the Galois field GF(q) ifTTI . 

In finite fields, the order of a number is defined as the lowest 
power of the number that equals 1 modulo q. An element of 
the field is called a primitive root of the field if its order is 
q — 1 [14|. For exemple, 3 is a primitive root in GF(65537), 
because 3 65536 = 1 mod 65537 and for each < i < 65536, 
3 4 ^ 1 mod 65537. 

The principles of the Fourier transform can be extended 
to these finite fields, as introduced by Pollard lfT31 . Let r 
be an element of order n — 1 in the field. In this case, 
the discrete Fourier transform (DFT), which takes a vector 
a = (do, a„_i) of size n as input in GF(q), returns: 

Aj = air 1 -' , < j < n — 1, n < q 

where A is a vector of size n. In the same way, the inverse 
DFT (DFT -1 ) can be defined as: 



1 ^ 

a-: = — x 
n 



i=0 



A, 



, < j < n — 1, n < q. 



Fermat Number Transform is used in finite fields in order 
to process the DFT. FNT is the equivalent of the FFT algo- 
rithm on Fermat fields [10|. Thanks to a divide and conquer 
approach, the complexity of the FNT of size n and its inverse, 
FNT -1 , reduces the process of the DFT to 0(n\ogn). 

The FNT can also be represented by a n x n square matrix. 
This matrix is a special case of a Vandermonde matrix on 
special set 1, r, r 2 , ...r n , where r is of order n — 1. Since 
these values are pairwise distinct, the matrix is invertible. The 
first k rows of this FNT matrix form the generator matrix of 
a Reed-Solomon code fl4l . 

Another way to represent the FNT comes from polynomials. 
Indeed, if a represents a vector of size n, we define a(x) as 
the polynomial J^i^o °>i x% - Thus, the FNT can be viewed as 



the evaluation of a(x) on the points 1, r, '* 



^ = 



a(r 3 ), < j < n — 1, which form a geometric sequence. 

B. Design of the code on Gi 7 '(65537) 

In this section, we will present the FNT over GF(65537) 
which allows to treat all codewords as 16-bit values, except 
65536. This case can be treated easily, by sending the positions 
where this value is used in the encoding symbols header, if 
relevant. 

As we have seen previously, 3 is of order 2 16 . Thus, FNTs 
of size n — 2 l can be produced with the root r = 3 

The objective is to encode k source symbols to create n 
encoding symbols, k < n < 65537. As stated previously, the 
FNT matrix represents an MDS code for each couple (n, k). 



For the sake of simplicity, we will take the hypothesis that k 
is a power of 2, without losing generality. The case when k is 
not a power of 2 can be treated as well, with padding on the 
FNTs that will be used. In the same manner, we take also n 
as a power of 2. The case where n is not a power of 2, can 
be seen as a puncturing of the code. 

Let s = (sq, Si, s^-i) be the source vector of 
size k and its associated polynomial s(x). Let e = 
(eo, ei, e„_i) be the encoded vector of size n. The encod- 
ing step consists in simply applying the FNT on the vector 
(so, Si, Sk-i, 0, 0) of size n. If r represents the root of 
unity for the FNT of size n, we have also: 

ei = s(r i ), < i < n - 1 

On the decoder side, some of the information in e could 
have been lost. If not, the decoding is simply a product of e 
by the FNT -1 matrix. The decoded symbols are then the first 
k symbols. However, in most cases, at least one of the encoded 
symbols can have been lost. As we are dealing with an MDS 
code, decoding is possible if at least k encoded symbols have 
been received. From a polynomial view, it means that at least 
k points of evaluation have been received for a polynomial 
of degree k — 1, which is a classical interpolation problem. 
This field has been heavily studied for hundreds of years and 
many algorithms have emerged to solve this problem |16|. 
Historically, Lagrange and Newton polynomial interpolation 
have been the most widespread solutions for this problem. 
However, in our case, the main disavantage of these algorithms 
is their quadratic complexity. 

Let (xq,xi, ...,Xk-i) be the first k evaluation points re- 
ceived by the decoder. In fact, this vector represents a sub- 
sequence of the geometric sequence 1, r, r 2 , ...r n . The de- 
coder has then received the values (s(xq), s(xi), s(xk-i))- 
Let this vector of size k be called v. The Lagrange interpolat- 
ing polynomial P is then: 



fe-i 



P(x) = £ U X JJ 



0<j<k-l ' 

where the coefficients of P are the coefficients of the source 
symbol s. Using the following definitions: 



A{x) = ]J (x - X-) 



and 



A{x) 



we can see that for i = 0...k — 1, A(xi) = and for each 
j ^ i, Ai(xj) — 0, the Lagrange interpolating polynomial can 
also be rewritten as following: 
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p(z) = E'' 



Ajjx) 



A(x) x 



i=0 



Let us define the derivative of the polynomial A: 



Let A (x) = X]i=o a i a;l ' k < n, and for i = - 0, 2n— 2, let 
ti = — /2, and the sequence bi = r ti . For i = 0, n— 1, 
let Ci be defined by c, = a i jbi. One may notice here that 
h+i = <fbi, and then we have: 
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A' = fll = EIL 



*=0 j/i 
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i=0 



then one can remark that Ai(xj) = if i ^ j, meaning 
that for each i = 0,1,..., fc — 1 we have Aj(:Ci) = A (xi). 
This result is crucial as it means that we can use directly 
the derivative of A in order to process all the Ai(xi) of the 
interpolating polynomial. 

The algorithm for processing the interpolating polynomial, 
and then, recovering the source symbols is: 

1) Calculate the polynomial A(x) 

2) Derivate the polynomial to obtain A (x) 

3) Evaluate A (x) on (xo, X±, x^-i) 

4) Process all the Vi/A^xA in rti 

5) Calculate Y]*" 1 

6) As the denominator of the previous fraction is A(x), the 
polynomial P(x) is directly its numerator. 

We will detail further the complexity of each item. Most of 
these items depend heavily on polynomial products. Hence, 
the design of an efficient method for processing the product 
of two polynomials will directly affect the complexity of 
the interpolation algorithm. In the following development, we 
will introduce M(n), which will correspond to the cost of 
multiplicating two polynomials of degree strictly lower than 
n. Let us detail the complexity of M(n), 

The classic polynomial product is unusable here because 
of its quadratic complexity. Using the fact that some 
products can be replaced by sums of already processed 
terms, Karatsuba [17) has reduced the complexity bound to 
0(n l ° 923 ) ~ 0(n 159 ). This concept has been generalized by 
Toom and Cook. Toom-fc algorithms can lead to a complexity 

!og(2k-l) 

of 0(n ! °s( fc ) ). However, the constant factor hidden in 
the big-0 notation, which fastly grows with k, prevents its 
practical use for k > 4. Our approach, is based on the well- 
known Schonhage-Strassen algorithm iTHl . This algorithm 
relies on FFTs to process the product of two polynomials. 
When all roots of unity are available, like in Fermat fields, the 
complexity of this algorithm is the complexity of processing 
3 FFTs (FNTs in our case). It results that the complexity of 
polynomial multiplication is, in our case M(n) = 0(n\ogn). 

In step 1, the product of k polynomials of degree 1 has to be 
processed. The resulting polynomial A(x) is then a polynomial 
of degree k. Using a divide and conquer method, the cost of 
step 1 is 0(M(k) logfc). Step 2 and step 4 have obviously a 
linear cost 0(n). 

In step 3, the problem is to evaluate a polynomial of degree 
k — 1 on k points. At this point, it may be noticed that 
the evaluation points form a sub-sequence of a sequence in 
geometric progression. It does mean that if the evaluation of 
the polynomial A is known on the geometric sequence of 
factor r, the knowledge on the evaluation points is then direct. 



A (r«) 



n-l 



3=0 



J=0 



From this statement, we can see that the values of A (r 1 ) can 
be viewed as the coefficients of the degrees n— 1, n, 2n — 2 
of the polynomial product of J27=o a^™ -1-1 by J2i=o 2 ^>i x% - 
Step 3 of the interpolation algorithm can be reduced to the 
product of two polynomials, and most importantly to the 
coefficients located in middle positions. The cost of step 3 
is then 0(M{n)) operations. 

As the last step of the algorithm is trivial, the only remaining 
complexity to determine is step 5. In this step we have to 
determine the numerator of the following sum: 



Pi?) 
A(x) 
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E 

i=0 



mod x 7 ' 



with the rii already determined in step 4. Using the Taylor 
series of l/(xi — x) = J^j we can rewrite the sum 

as: 



fc-i 

E 

i=0 



mod x n 




X" J 2 " J 



Then, we will also use the fact that the sequence xi is a 
sub-sequence of a sequence in geometric progression. We can 
then define each Xi as a power of the root, Xi — r Zi with 
z; < n. Then the sum becomes: 



fc-i 

En. 
— 



i=0 



mod x n = -Y J N'(r- j - 1 )^ 



3=0 



where N (x) is the polynomial Y^^o n i% Zi - Since the 



are in geometric progression, we 



points 1 , r 1 , 

can conclude that the complexity of step 5 is also 0(M(n)). 

The decoding cost of the DFT-based code with this algo- 
rithm, in the general case, is 0(fclog 2 k + nlogn) operations, 
with k < n. 

It may be noticed that steps 1 to 3 do not depend on 
the received values but only on their positions. It means 
that if the received positions are known to be static, these 
steps have to be processed only once. In the case of data 
transmission, data units can be UDP or RTP packets in the 
case of network transmission, or file chunks in the case of 
peer-to-peer networks. In the first case, the size of the packets 
can be up to 1500 bytes, and in the second case up to 
several megabytes. This means that the positions are static for 
hundreds to millions symbols, which are two bytes long here, 



and thus, the complexity of these steps, mainly 0(klog 2 k) 
can be neglected compared to the other steps. 

In conclusion, in practice, the complexity of the decoding 
algorithm falls to O(nlogn). 

C. Toward a systematic code 

In network transmissions, it is often essential to have a 
systematic code, i.e. a code where the source data units 
are included in the encoded data units. Using the same 
algorithm, it is possible to design a systematic code. Let 
9 = (so, Si, ...Sk-x) be a source vector of size k. Interpolating 
these values with the previous algorithm on the first k posi- 
tions will lead to the "decoding" of k intermediate symbols, 
i = (io,ii, ...,ifc_i). Then producing the DFT of the vector i 
with the same fast algorithm as the non-systematic case will 
produce the encoded vector e where the first k positions will 
be the systematic positions. 

On the decoder side, if some source symbols (i.e. systematic 
positions) are missing, the same decoding as before on the 
first k received positions is applied. This allows to recover the 
vector of intermediate symbols i. Then, the same DFT as the 
one used during encoding is applied to recover the full encoded 
vector e whose first k positions are the source symbols. 

The practical complexity of a systematic version remains 
0(n log n), however with a greater constant term. 

III. Complexity analysis 

In this section, we will detail the complexity of the encoder 
and the decoder in both non-systematic and systematic coding, 
in terms of numbers of FNT of size n. The reason is that 
the big-O notation hides the constant term which has a major 
practical importance. [[] 

First of all, the complexity of the interpolation algorithm 
has to be discussed, and more precisely step 5, as it is in 
practice, the most costly step. In step 5, we first evaluate 
the polynomial TV on n positions. As we have seen before, 
this evaluation is equivalent to the multiplication of two 
polynomials of respectively degree n — 1 and 2n — 2. As we 
stated in the beginning, n is a power of two, meaning that 
the product is based on FNTs of size An. As seen in section 
|H] the only useful coefficients are the middle ones, so we can 
apply the Middle Product algorithm (MP) fl2l . which helps 
to reduce the FNTs size to 2n. The latter element of step 
5, is to multiply the resulting sum by A(x). Since k < n, 
the product of the sum by A(x), whose only the first k 
coefficients are kept, is always of degree strictly greater than 
n but lower than 2n. To summarize, the cost of interpolation, 
in terms of FNT n , is 3FNT 2 „+FNT 2 „ = 4FNT 2 „. As the 
FNT has a logarithmic cost, for reasonable sizes, we can say 
that FNT2n — 2FNT„, meaning that the interpolation cost is 
roughly 8FNT„. 

For the encoder, we propose three algorithms, 

'it is also worth noticing that classic quadratic encoding and decoding 
schemes are also available for these codes. 



• The FNT-based algorithm described above for both sys- 
tematic and non-systematic case; 

• The direct encoding of the n-k non-systematic symbols by 
a quadratic complexity algorithm in the systematic case; 

• The processing of the k intermediate symbols with a 
direct product and then a FNT, in the systematic case; 

On the decoder side, we will use two algorithms, 

• The FNT-based interpolation algorithm described above 
for both systematic and non-systematic case; 

• The direct decoding of the missing symbols; 

The complexities are summarized in Table [j] and Table [II] 

Non-Systematic Encoder Systematic Encoder 
FNT-based " lFNT n 9FNT ra 
Direct Encoding - 0((n-k).k) 
Direct Interm. & FNT - 0(k 2 )+FNT„ 



TABLE I 

Complexity of the encoder for both non-systematic and 
systematic case 





Non-Systematic Decoder 


Systematic Decoder 


FNT-based 


8FNT n 


9FNT n 


Direct Decoding 




<0((n-k).k) 



TABLE II 

Complexity of the decoder for both non-systematic and 
systematic case 



For the systematic case, direct encoding and decoding 
methods are more adapted than FNT-based for small sizes. 
However, the last method for the encoder is well suited to 
small rate codes, i.e. k<<n. 

IV. Simulation results 

We have implemented [ 19 1 a FNT-based algorithm, with the 
encoders and decoders described above. For k<256, quadratic 
complexity algorithms are used. Karatsuba and quadratic al- 
gorithms are used for polynomial products of degree less than 
64. We have tested also the Reed-Solomon implementation 
by Rizzo [5|, named Vandermonde implementation, and the 
XOR-based implementation [7|, at their optimal tuning. It 
is important to notice that compared to the other RS algo- 
rithms presented here, FNT-based codes do not require any 
tuning on the field size. The codes were tested on a Intel 
Core 2 Extreme® 3. 06Ghz on Mac OS X 10.5 with 64-bit 
compilation. Fig. [Tj provides the encoding speed for the three 
algorithms on a systematic case, and also the speed of non- 
systematic code for FNT-based algorithm, for a coderate 1/2. 
Fig. [2] shows the corresponding decoding speed. The speed of 
the non-systematic case is not represented in this figure, as it 
is slightly the same as that of the systematic case, as seen in 
Table EU 

In the systematic scenario, FNT-based algorithms operates 
as fast as Vandermonde and XOR-based implementations for 
small sizes. The underlying algorithm is roughly equivalent to 




1 Mbps t 1 1 < 1 ' ' 1 1 1 1 1 

16 32 64 128 256 512 1024 2048 4096 8192 16384 

Vandermonde -*-XOR-based FNT-based - syst. -*-FNT-based - non-syst. 



Fig. 1. Encoding speed for Vandermonde, XOR-based and FNT-based 
encoders for different source numbers 



1000 Mbps 



100 Mbps 



10 Mbps 



1 Mbps 




■Vandermonde -■-XOR-based FNT-based 



Fig. 2. Decoding speed for Vandermonde, XOR-based and FNT-based 
decoders for different source numbers 



the other ones, but performs on integer computations, instead 
of binary polynomials. 

However, as soon as the source size achieves hundreds of 
symbols, the logarithmic complexity algorithm of the FNT- 
based code allows a vast improvement compared to the other 
codes. In addition, the speed of the encoder/decoder decreases 
only slow, and speeds over 10Mbps are achieved with more 
than 10,000 source symbols. 

For the non-systematic case, the encoder is about 10 times 
faster than the systematic one. This is in line with the 
complexity analysis. 

V. Conclusion and future work 

This paper presents the first implementation of an MDS 
code whose encoding and decoding complexity is O(nlogn). 
These results open the way for the practical use of MDS codes 
for sizes of thousands elements. Moreover, the speed of the 
encoder, which is roughly the speed of a FNT, allows many 
direct applications. Among them, distributed data backup and 
peer-to-peer networks are areas where the encoding speed is 
critical, as well as the number of nodes implicated. With prac- 
tical speeds of over 100Mbps, FNT-based coding is answering 
this issue, while offering a practically usable decoding speed. 



In deep space applications, the encoding cost could also benefit 
to low power systems. 

As the speed of the algorithm heavily depends on the FNT, 
many ways can be explored for its optimization. Indeed, the 
FNT, which is a special case of FFT could benefit from par- 
allel computing (multi-threading, GPU computing). Another 
interesting point is the re-use of existing optimized software 
and hardware implementations |20||13|. 
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